Challenge understanding

Objective

Predict survival on Titanic dataset

Competition Description

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy. https://www.kaggle.com/c/titanic


In [ ]:

Initial Idea

  1. Load Library Modules
  2. Load Datasets
  3. Explore datasets
  4. Analyse relations between features
  5. Analyse missing values
  6. Analyse features
  7. Prepare for modelling
  8. Modelling
  9. Prepare the prediction for submission

1. Loading Library Modules


In [13]:
import warnings
warnings.filterwarnings('ignore')

# SKLearn Model Algorithms
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression , Perceptron

from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC, LinearSVC

# SKLearn ensemble classifiers
from sklearn.ensemble import RandomForestClassifier , GradientBoostingClassifier
from sklearn.ensemble import ExtraTreesClassifier , BaggingClassifier
from sklearn.ensemble import VotingClassifier , AdaBoostClassifier

# SKLearn Modelling Helpers
from sklearn.preprocessing import Imputer , Normalizer , scale
from sklearn.cross_validation import train_test_split , StratifiedKFold
from sklearn.feature_selection import RFECV

# Handle table-like data and matrices
import numpy as np
import pandas as pd

# Visualisation
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns

# plot functions
import pltFunctions as pfunc

# Configure visualisations
%matplotlib inline
mpl.style.use( 'ggplot' )
sns.set_style( 'white' )
pylab.rcParams[ 'figure.figsize' ] = 8 , 6

2. Loading Datasets


In [14]:
train = pd.read_csv("./input/train.csv")
test    = pd.read_csv("./input/test.csv")

In [15]:
#combined = pd.concat([train.drop('Survived',1),test])
#combined = train.append( test, ignore_index = True)
full = train.append( test, ignore_index = True)
del train, test
#train = full[ :891 ]
#combined = combined.drop( 'Survived',1)

In [16]:
#print ('Datasets:' , 'combined:' , combined.shape , 'full:' , full.shape , 'train:' , train.shape)

3. Exploring datasets


In [17]:
full.head(10)


Out[17]:
Age Cabin Embarked Fare Name Parch PassengerId Pclass Sex SibSp Survived Ticket
0 22.0 NaN S 7.2500 Braund, Mr. Owen Harris 0 1 3 male 1 0.0 A/5 21171
1 38.0 C85 C 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 2 1 female 1 1.0 PC 17599
2 26.0 NaN S 7.9250 Heikkinen, Miss. Laina 0 3 3 female 0 1.0 STON/O2. 3101282
3 35.0 C123 S 53.1000 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 4 1 female 1 1.0 113803
4 35.0 NaN S 8.0500 Allen, Mr. William Henry 0 5 3 male 0 0.0 373450
5 NaN NaN Q 8.4583 Moran, Mr. James 0 6 3 male 0 0.0 330877
6 54.0 E46 S 51.8625 McCarthy, Mr. Timothy J 0 7 1 male 0 0.0 17463
7 2.0 NaN S 21.0750 Palsson, Master. Gosta Leonard 1 8 3 male 3 0.0 349909
8 27.0 NaN S 11.1333 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) 2 9 3 female 0 1.0 347742
9 14.0 NaN C 30.0708 Nasser, Mrs. Nicholas (Adele Achem) 0 10 2 female 1 1.0 237736

In [18]:
print(full.isnull().sum())


Age             263
Cabin          1014
Embarked          2
Fare              1
Name              0
Parch             0
PassengerId       0
Pclass            0
Sex               0
SibSp             0
Survived        418
Ticket            0
dtype: int64

In [19]:
pd.crosstab(full['Pclass'], full['Sex'])


Out[19]:
Sex female male
Pclass
1 144 179
2 106 171
3 216 493

In [20]:
print( full.groupby(['Sex','Pclass'])['Age'].mean() )
agedf = full.groupby(['Sex','Pclass'])['Age'].mean()
type( agedf )


Sex     Pclass
female  1         37.037594
        2         27.499223
        3         22.185329
male    1         41.029272
        2         30.815380
        3         25.962264
Name: Age, dtype: float64
Out[20]:
pandas.core.series.Series

In [21]:
#for age in full:
#    if full['Age'].isnull():
#        print (agedf.where(agedf['Sex'] == full['Sex'])&(agedf['Pclass']==full['Pclass']))

In [22]:
def fillMissingAge(dframe):
    dframe['Age'] = dframe['Age'].fillna( dframe['Age'].mean())
    return dframe

def fillMissingFare(dframe):
    dframe['Fare'] = dframe['Fare'].fillna( dframe['Fare'].mean() )
    return dframe

In [23]:
full = fillMissingAge(full)
full = fillMissingFare(full)
print(full.isnull().sum())


Age               0
Cabin          1014
Embarked          2
Fare              0
Name              0
Parch             0
PassengerId       0
Pclass            0
Sex               0
SibSp             0
Survived        418
Ticket            0
dtype: int64

In [ ]:


In [24]:
print(full[full['Embarked'].isnull()])


      Age Cabin Embarked  Fare                                       Name  \
61   38.0   B28      NaN  80.0                        Icard, Miss. Amelie   
829  62.0   B28      NaN  80.0  Stone, Mrs. George Nelson (Martha Evelyn)   

     Parch  PassengerId  Pclass     Sex  SibSp  Survived  Ticket  
61       0           62       1  female      0       1.0  113572  
829      0          830       1  female      0       1.0  113572  

In [25]:
pd.crosstab(full['Embarked'], full['Sex'].where(full['Sex'] == 1))


Out[25]:

In [26]:
full.where((full['Sex']==1) & (full['Pclass']==1)).groupby(['Embarked','Pclass','Parch','SibSp']).size()


Out[26]:
Series([], dtype: int64)

In [27]:
nt=(115+60+291)
pC=115/nt
pQ=60/nt
pS=291/nt
print('Prob C :', pC, 'Prob Q :', pQ ,'Prob S :' , pS)

nC=(30+2+20)
p0C=30/nC
p0Q=2/nC
p0S=20/nC
print('Prob C :', p0C, 'Prob Q :', p0Q ,'Prob S :' , p0S)

print( 'Sum of probabilities')
print('Prob C :', pC+p0C, 'Prob Q :', pQ+p0Q ,'Prob S :' , pS+p0S)


Prob C : 0.24678111587982832 Prob Q : 0.12875536480686695 Prob S : 0.6244635193133047
Prob C : 0.5769230769230769 Prob Q : 0.038461538461538464 Prob S : 0.38461538461538464
Sum of probabilities
Prob C : 0.8237041928029052 Prob Q : 0.1672169032684054 Prob S : 1.0090789039286894

In [28]:
# Trying S for both  passengers
full['Embarked'].iloc[61] = "S"
full['Embarked'].iloc[829] = "S"

In [29]:
print(full.isnull().sum())


Age               0
Cabin          1014
Embarked          0
Fare              0
Name              0
Parch             0
PassengerId       0
Pclass            0
Sex               0
SibSp             0
Survived        418
Ticket            0
dtype: int64

In [30]:
def fillCabin(dframe):
    dframe[ 'Cabin' ] = dframe['Cabin'].fillna( 'U' )
    dframe[ 'Cabin' ] = dframe[ 'Cabin' ].map( lambda c : c[0] )
    # dummy encoding ...
    dframe = pd.get_dummies( dframe['Cabin'] , prefix = 'Cabin' )
    return dframe

In [31]:
print(fillCabin(full))
newDF = fillCabin(full)
full = pd.concat([full, newDF], axis=1)
#full = full.drop('Cabin',1)


      Cabin_A  Cabin_B  Cabin_C  Cabin_D  Cabin_E  Cabin_F  Cabin_G  Cabin_T  \
0           0        0        0        0        0        0        0        0   
1           0        0        1        0        0        0        0        0   
2           0        0        0        0        0        0        0        0   
3           0        0        1        0        0        0        0        0   
4           0        0        0        0        0        0        0        0   
5           0        0        0        0        0        0        0        0   
6           0        0        0        0        1        0        0        0   
7           0        0        0        0        0        0        0        0   
8           0        0        0        0        0        0        0        0   
9           0        0        0        0        0        0        0        0   
10          0        0        0        0        0        0        1        0   
11          0        0        1        0        0        0        0        0   
12          0        0        0        0        0        0        0        0   
13          0        0        0        0        0        0        0        0   
14          0        0        0        0        0        0        0        0   
15          0        0        0        0        0        0        0        0   
16          0        0        0        0        0        0        0        0   
17          0        0        0        0        0        0        0        0   
18          0        0        0        0        0        0        0        0   
19          0        0        0        0        0        0        0        0   
20          0        0        0        0        0        0        0        0   
21          0        0        0        1        0        0        0        0   
22          0        0        0        0        0        0        0        0   
23          1        0        0        0        0        0        0        0   
24          0        0        0        0        0        0        0        0   
25          0        0        0        0        0        0        0        0   
26          0        0        0        0        0        0        0        0   
27          0        0        1        0        0        0        0        0   
28          0        0        0        0        0        0        0        0   
29          0        0        0        0        0        0        0        0   
...       ...      ...      ...      ...      ...      ...      ...      ...   
1279        0        0        0        0        0        0        0        0   
1280        0        0        0        0        0        0        0        0   
1281        0        1        0        0        0        0        0        0   
1282        0        0        0        1        0        0        0        0   
1283        0        0        0        0        0        0        0        0   
1284        0        0        0        0        0        0        0        0   
1285        0        0        0        0        0        0        0        0   
1286        0        0        1        0        0        0        0        0   
1287        0        0        0        0        0        0        0        0   
1288        0        1        0        0        0        0        0        0   
1289        0        0        0        0        0        0        0        0   
1290        0        0        0        0        0        0        0        0   
1291        0        0        1        0        0        0        0        0   
1292        0        0        0        0        0        0        0        0   
1293        0        0        0        0        0        0        0        0   
1294        0        0        0        0        0        0        0        0   
1295        0        0        0        1        0        0        0        0   
1296        0        0        0        1        0        0        0        0   
1297        0        0        0        0        0        0        0        0   
1298        0        0        1        0        0        0        0        0   
1299        0        0        0        0        0        0        0        0   
1300        0        0        0        0        0        0        0        0   
1301        0        0        0        0        0        0        0        0   
1302        0        0        1        0        0        0        0        0   
1303        0        0        0        0        0        0        0        0   
1304        0        0        0        0        0        0        0        0   
1305        0        0        1        0        0        0        0        0   
1306        0        0        0        0        0        0        0        0   
1307        0        0        0        0        0        0        0        0   
1308        0        0        0        0        0        0        0        0   

      Cabin_U  
0           1  
1           0  
2           1  
3           0  
4           1  
5           1  
6           0  
7           1  
8           1  
9           1  
10          0  
11          0  
12          1  
13          1  
14          1  
15          1  
16          1  
17          1  
18          1  
19          1  
20          1  
21          0  
22          1  
23          0  
24          1  
25          1  
26          1  
27          0  
28          1  
29          1  
...       ...  
1279        1  
1280        1  
1281        0  
1282        0  
1283        1  
1284        1  
1285        1  
1286        0  
1287        1  
1288        0  
1289        1  
1290        1  
1291        0  
1292        1  
1293        1  
1294        1  
1295        0  
1296        0  
1297        1  
1298        0  
1299        1  
1300        1  
1301        1  
1302        0  
1303        1  
1304        1  
1305        0  
1306        1  
1307        1  
1308        1  

[1309 rows x 9 columns]

In [32]:
full


Out[32]:
Age Cabin Embarked Fare Name Parch PassengerId Pclass Sex SibSp ... Ticket Cabin_A Cabin_B Cabin_C Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U
0 22.000000 U S 7.2500 Braund, Mr. Owen Harris 0 1 3 male 1 ... A/5 21171 0 0 0 0 0 0 0 0 1
1 38.000000 C C 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 2 1 female 1 ... PC 17599 0 0 1 0 0 0 0 0 0
2 26.000000 U S 7.9250 Heikkinen, Miss. Laina 0 3 3 female 0 ... STON/O2. 3101282 0 0 0 0 0 0 0 0 1
3 35.000000 C S 53.1000 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 4 1 female 1 ... 113803 0 0 1 0 0 0 0 0 0
4 35.000000 U S 8.0500 Allen, Mr. William Henry 0 5 3 male 0 ... 373450 0 0 0 0 0 0 0 0 1
5 29.881138 U Q 8.4583 Moran, Mr. James 0 6 3 male 0 ... 330877 0 0 0 0 0 0 0 0 1
6 54.000000 E S 51.8625 McCarthy, Mr. Timothy J 0 7 1 male 0 ... 17463 0 0 0 0 1 0 0 0 0
7 2.000000 U S 21.0750 Palsson, Master. Gosta Leonard 1 8 3 male 3 ... 349909 0 0 0 0 0 0 0 0 1
8 27.000000 U S 11.1333 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) 2 9 3 female 0 ... 347742 0 0 0 0 0 0 0 0 1
9 14.000000 U C 30.0708 Nasser, Mrs. Nicholas (Adele Achem) 0 10 2 female 1 ... 237736 0 0 0 0 0 0 0 0 1
10 4.000000 G S 16.7000 Sandstrom, Miss. Marguerite Rut 1 11 3 female 1 ... PP 9549 0 0 0 0 0 0 1 0 0
11 58.000000 C S 26.5500 Bonnell, Miss. Elizabeth 0 12 1 female 0 ... 113783 0 0 1 0 0 0 0 0 0
12 20.000000 U S 8.0500 Saundercock, Mr. William Henry 0 13 3 male 0 ... A/5. 2151 0 0 0 0 0 0 0 0 1
13 39.000000 U S 31.2750 Andersson, Mr. Anders Johan 5 14 3 male 1 ... 347082 0 0 0 0 0 0 0 0 1
14 14.000000 U S 7.8542 Vestrom, Miss. Hulda Amanda Adolfina 0 15 3 female 0 ... 350406 0 0 0 0 0 0 0 0 1
15 55.000000 U S 16.0000 Hewlett, Mrs. (Mary D Kingcome) 0 16 2 female 0 ... 248706 0 0 0 0 0 0 0 0 1
16 2.000000 U Q 29.1250 Rice, Master. Eugene 1 17 3 male 4 ... 382652 0 0 0 0 0 0 0 0 1
17 29.881138 U S 13.0000 Williams, Mr. Charles Eugene 0 18 2 male 0 ... 244373 0 0 0 0 0 0 0 0 1
18 31.000000 U S 18.0000 Vander Planke, Mrs. Julius (Emelia Maria Vande... 0 19 3 female 1 ... 345763 0 0 0 0 0 0 0 0 1
19 29.881138 U C 7.2250 Masselmani, Mrs. Fatima 0 20 3 female 0 ... 2649 0 0 0 0 0 0 0 0 1
20 35.000000 U S 26.0000 Fynney, Mr. Joseph J 0 21 2 male 0 ... 239865 0 0 0 0 0 0 0 0 1
21 34.000000 D S 13.0000 Beesley, Mr. Lawrence 0 22 2 male 0 ... 248698 0 0 0 1 0 0 0 0 0
22 15.000000 U Q 8.0292 McGowan, Miss. Anna "Annie" 0 23 3 female 0 ... 330923 0 0 0 0 0 0 0 0 1
23 28.000000 A S 35.5000 Sloper, Mr. William Thompson 0 24 1 male 0 ... 113788 1 0 0 0 0 0 0 0 0
24 8.000000 U S 21.0750 Palsson, Miss. Torborg Danira 1 25 3 female 3 ... 349909 0 0 0 0 0 0 0 0 1
25 38.000000 U S 31.3875 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... 5 26 3 female 1 ... 347077 0 0 0 0 0 0 0 0 1
26 29.881138 U C 7.2250 Emir, Mr. Farred Chehab 0 27 3 male 0 ... 2631 0 0 0 0 0 0 0 0 1
27 19.000000 C S 263.0000 Fortune, Mr. Charles Alexander 2 28 1 male 3 ... 19950 0 0 1 0 0 0 0 0 0
28 29.881138 U Q 7.8792 O'Dwyer, Miss. Ellen "Nellie" 0 29 3 female 0 ... 330959 0 0 0 0 0 0 0 0 1
29 29.881138 U S 7.8958 Todoroff, Mr. Lalio 0 30 3 male 0 ... 349216 0 0 0 0 0 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1279 21.000000 U Q 7.7500 Canavan, Mr. Patrick 0 1280 3 male 0 ... 364858 0 0 0 0 0 0 0 0 1
1280 6.000000 U S 21.0750 Palsson, Master. Paul Folke 1 1281 3 male 3 ... 349909 0 0 0 0 0 0 0 0 1
1281 23.000000 B S 93.5000 Payne, Mr. Vivian Ponsonby 0 1282 1 male 0 ... 12749 0 1 0 0 0 0 0 0 0
1282 51.000000 D S 39.4000 Lines, Mrs. Ernest H (Elizabeth Lindsey James) 1 1283 1 female 0 ... PC 17592 0 0 0 1 0 0 0 0 0
1283 13.000000 U S 20.2500 Abbott, Master. Eugene Joseph 2 1284 3 male 0 ... C.A. 2673 0 0 0 0 0 0 0 0 1
1284 47.000000 U S 10.5000 Gilbert, Mr. William 0 1285 2 male 0 ... C.A. 30769 0 0 0 0 0 0 0 0 1
1285 29.000000 U S 22.0250 Kink-Heilmann, Mr. Anton 1 1286 3 male 3 ... 315153 0 0 0 0 0 0 0 0 1
1286 18.000000 C S 60.0000 Smith, Mrs. Lucien Philip (Mary Eloise Hughes) 0 1287 1 female 1 ... 13695 0 0 1 0 0 0 0 0 0
1287 24.000000 U Q 7.2500 Colbert, Mr. Patrick 0 1288 3 male 0 ... 371109 0 0 0 0 0 0 0 0 1
1288 48.000000 B C 79.2000 Frolicher-Stehli, Mrs. Maxmillian (Margaretha ... 1 1289 1 female 1 ... 13567 0 1 0 0 0 0 0 0 0
1289 22.000000 U S 7.7750 Larsson-Rondberg, Mr. Edvard A 0 1290 3 male 0 ... 347065 0 0 0 0 0 0 0 0 1
1290 31.000000 U Q 7.7333 Conlon, Mr. Thomas Henry 0 1291 3 male 0 ... 21332 0 0 0 0 0 0 0 0 1
1291 30.000000 C S 164.8667 Bonnell, Miss. Caroline 0 1292 1 female 0 ... 36928 0 0 1 0 0 0 0 0 0
1292 38.000000 U S 21.0000 Gale, Mr. Harry 0 1293 2 male 1 ... 28664 0 0 0 0 0 0 0 0 1
1293 22.000000 U C 59.4000 Gibson, Miss. Dorothy Winifred 1 1294 1 female 0 ... 112378 0 0 0 0 0 0 0 0 1
1294 17.000000 U S 47.1000 Carrau, Mr. Jose Pedro 0 1295 1 male 0 ... 113059 0 0 0 0 0 0 0 0 1
1295 43.000000 D C 27.7208 Frauenthal, Mr. Isaac Gerald 0 1296 1 male 1 ... 17765 0 0 0 1 0 0 0 0 0
1296 20.000000 D C 13.8625 Nourney, Mr. Alfred (Baron von Drachstedt")" 0 1297 2 male 0 ... SC/PARIS 2166 0 0 0 1 0 0 0 0 0
1297 23.000000 U S 10.5000 Ware, Mr. William Jeffery 0 1298 2 male 1 ... 28666 0 0 0 0 0 0 0 0 1
1298 50.000000 C C 211.5000 Widener, Mr. George Dunton 1 1299 1 male 1 ... 113503 0 0 1 0 0 0 0 0 0
1299 29.881138 U Q 7.7208 Riordan, Miss. Johanna Hannah"" 0 1300 3 female 0 ... 334915 0 0 0 0 0 0 0 0 1
1300 3.000000 U S 13.7750 Peacock, Miss. Treasteall 1 1301 3 female 1 ... SOTON/O.Q. 3101315 0 0 0 0 0 0 0 0 1
1301 29.881138 U Q 7.7500 Naughton, Miss. Hannah 0 1302 3 female 0 ... 365237 0 0 0 0 0 0 0 0 1
1302 37.000000 C Q 90.0000 Minahan, Mrs. William Edward (Lillian E Thorpe) 0 1303 1 female 1 ... 19928 0 0 1 0 0 0 0 0 0
1303 28.000000 U S 7.7750 Henriksson, Miss. Jenny Lovisa 0 1304 3 female 0 ... 347086 0 0 0 0 0 0 0 0 1
1304 29.881138 U S 8.0500 Spector, Mr. Woolf 0 1305 3 male 0 ... A.5. 3236 0 0 0 0 0 0 0 0 1
1305 39.000000 C C 108.9000 Oliva y Ocana, Dona. Fermina 0 1306 1 female 0 ... PC 17758 0 0 1 0 0 0 0 0 0
1306 38.500000 U S 7.2500 Saether, Mr. Simon Sivertsen 0 1307 3 male 0 ... SOTON/O.Q. 3101262 0 0 0 0 0 0 0 0 1
1307 29.881138 U S 8.0500 Ware, Mr. Frederick 0 1308 3 male 0 ... 359309 0 0 0 0 0 0 0 0 1
1308 29.881138 U C 22.3583 Peter, Master. Michael J 1 1309 3 male 1 ... 2668 0 0 0 0 0 0 0 0 1

1309 rows × 21 columns


In [33]:
#print( full.where((full['Sex'] == 0) & (full['Pclass'] == 1)).groupby(['Pclass','Sex'])['Age'].mean() )
print( full['Sex'].isnull().sum() )


0

In [ ]:


In [34]:
#byTicket = full.where(full['Cabin'].isnull()).groupby(['Name'])['Ticket']
#byFare = full.where(full['Cabin'].isnull()).groupby(['Pclass'])['Fare']
#byTicket.head(5)
#byFare.head(5)

In [35]:
full = pfunc.convertSexToNum(full)
full.head()


Out[35]:
Age Cabin Embarked Fare Name Parch PassengerId Pclass SibSp Survived ... Cabin_A Cabin_B Cabin_C Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U Sex
0 22.0 U S 7.2500 Braund, Mr. Owen Harris 0 1 3 1 0.0 ... 0 0 0 0 0 0 0 0 1 0
1 38.0 C C 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 2 1 1 1.0 ... 0 0 1 0 0 0 0 0 0 1
2 26.0 U S 7.9250 Heikkinen, Miss. Laina 0 3 3 0 1.0 ... 0 0 0 0 0 0 0 0 1 1
3 35.0 C S 53.1000 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 4 1 1 1.0 ... 0 0 1 0 0 0 0 0 0 1
4 35.0 U S 8.0500 Allen, Mr. William Henry 0 5 3 0 0.0 ... 0 0 0 0 0 0 0 0 1 0

5 rows × 21 columns


In [36]:
# Naming the Deck accordingly to the Cabin description
# Naming the Deck as U due to unknown Cabin description
full = pfunc.fillDeck(full)

pd.crosstab(full['Deck'], full['Survived'])


Out[36]:
Survived 0.0 1.0
Deck
A 8 7
B 12 35
C 24 35
D 8 25
E 8 24
F 5 8
G 2 2
T 1 0
U 481 206

In [37]:
print(full.isnull().sum())
print("========================================")
print(full.info())


Age              0
Cabin            0
Embarked         0
Fare             0
Name             0
Parch            0
PassengerId      0
Pclass           0
SibSp            0
Survived       418
Ticket           0
Cabin_A          0
Cabin_B          0
Cabin_C          0
Cabin_D          0
Cabin_E          0
Cabin_F          0
Cabin_G          0
Cabin_T          0
Cabin_U          0
Sex              0
Deck             0
dtype: int64
========================================
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 22 columns):
Age            1309 non-null float64
Cabin          1309 non-null object
Embarked       1309 non-null object
Fare           1309 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
Cabin_A        1309 non-null uint8
Cabin_B        1309 non-null uint8
Cabin_C        1309 non-null uint8
Cabin_D        1309 non-null uint8
Cabin_E        1309 non-null uint8
Cabin_F        1309 non-null uint8
Cabin_G        1309 non-null uint8
Cabin_T        1309 non-null uint8
Cabin_U        1309 non-null uint8
Sex            1309 non-null int64
Deck           1309 non-null object
dtypes: float64(3), int64(5), object(5), uint8(9)
memory usage: 144.5+ KB
None

In [38]:
print(pfunc.featureEng( full ))
full = pfunc.featureEng( full )


            Age Cabin Embarked      Fare  \
0     22.000000     U        S    7.2500   
1     38.000000     C        C   71.2833   
2     26.000000     U        S    7.9250   
3     35.000000     C        S   53.1000   
4     35.000000     U        S    8.0500   
5     29.881138     U        Q    8.4583   
6     54.000000     E        S   51.8625   
7      2.000000     U        S   21.0750   
8     27.000000     U        S   11.1333   
9     14.000000     U        C   30.0708   
10     4.000000     G        S   16.7000   
11    58.000000     C        S   26.5500   
12    20.000000     U        S    8.0500   
13    39.000000     U        S   31.2750   
14    14.000000     U        S    7.8542   
15    55.000000     U        S   16.0000   
16     2.000000     U        Q   29.1250   
17    29.881138     U        S   13.0000   
18    31.000000     U        S   18.0000   
19    29.881138     U        C    7.2250   
20    35.000000     U        S   26.0000   
21    34.000000     D        S   13.0000   
22    15.000000     U        Q    8.0292   
23    28.000000     A        S   35.5000   
24     8.000000     U        S   21.0750   
25    38.000000     U        S   31.3875   
26    29.881138     U        C    7.2250   
27    19.000000     C        S  263.0000   
28    29.881138     U        Q    7.8792   
29    29.881138     U        S    7.8958   
...         ...   ...      ...       ...   
1279  21.000000     U        Q    7.7500   
1280   6.000000     U        S   21.0750   
1281  23.000000     B        S   93.5000   
1282  51.000000     D        S   39.4000   
1283  13.000000     U        S   20.2500   
1284  47.000000     U        S   10.5000   
1285  29.000000     U        S   22.0250   
1286  18.000000     C        S   60.0000   
1287  24.000000     U        Q    7.2500   
1288  48.000000     B        C   79.2000   
1289  22.000000     U        S    7.7750   
1290  31.000000     U        Q    7.7333   
1291  30.000000     C        S  164.8667   
1292  38.000000     U        S   21.0000   
1293  22.000000     U        C   59.4000   
1294  17.000000     U        S   47.1000   
1295  43.000000     D        C   27.7208   
1296  20.000000     D        C   13.8625   
1297  23.000000     U        S   10.5000   
1298  50.000000     C        C  211.5000   
1299  29.881138     U        Q    7.7208   
1300   3.000000     U        S   13.7750   
1301  29.881138     U        Q    7.7500   
1302  37.000000     C        Q   90.0000   
1303  28.000000     U        S    7.7750   
1304  29.881138     U        S    8.0500   
1305  39.000000     C        C  108.9000   
1306  38.500000     U        S    7.2500   
1307  29.881138     U        S    8.0500   
1308  29.881138     U        C   22.3583   

                                                   Name  Parch  PassengerId  \
0                               Braund, Mr. Owen Harris      0            1   
1     Cumings, Mrs. John Bradley (Florence Briggs Th...      0            2   
2                                Heikkinen, Miss. Laina      0            3   
3          Futrelle, Mrs. Jacques Heath (Lily May Peel)      0            4   
4                              Allen, Mr. William Henry      0            5   
5                                      Moran, Mr. James      0            6   
6                               McCarthy, Mr. Timothy J      0            7   
7                        Palsson, Master. Gosta Leonard      1            8   
8     Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)      2            9   
9                   Nasser, Mrs. Nicholas (Adele Achem)      0           10   
10                      Sandstrom, Miss. Marguerite Rut      1           11   
11                             Bonnell, Miss. Elizabeth      0           12   
12                       Saundercock, Mr. William Henry      0           13   
13                          Andersson, Mr. Anders Johan      5           14   
14                 Vestrom, Miss. Hulda Amanda Adolfina      0           15   
15                     Hewlett, Mrs. (Mary D Kingcome)       0           16   
16                                 Rice, Master. Eugene      1           17   
17                         Williams, Mr. Charles Eugene      0           18   
18    Vander Planke, Mrs. Julius (Emelia Maria Vande...      0           19   
19                              Masselmani, Mrs. Fatima      0           20   
20                                 Fynney, Mr. Joseph J      0           21   
21                                Beesley, Mr. Lawrence      0           22   
22                          McGowan, Miss. Anna "Annie"      0           23   
23                         Sloper, Mr. William Thompson      0           24   
24                        Palsson, Miss. Torborg Danira      1           25   
25    Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...      5           26   
26                              Emir, Mr. Farred Chehab      0           27   
27                       Fortune, Mr. Charles Alexander      2           28   
28                        O'Dwyer, Miss. Ellen "Nellie"      0           29   
29                                  Todoroff, Mr. Lalio      0           30   
...                                                 ...    ...          ...   
1279                               Canavan, Mr. Patrick      0         1280   
1280                        Palsson, Master. Paul Folke      1         1281   
1281                         Payne, Mr. Vivian Ponsonby      0         1282   
1282     Lines, Mrs. Ernest H (Elizabeth Lindsey James)      1         1283   
1283                      Abbott, Master. Eugene Joseph      2         1284   
1284                               Gilbert, Mr. William      0         1285   
1285                           Kink-Heilmann, Mr. Anton      1         1286   
1286     Smith, Mrs. Lucien Philip (Mary Eloise Hughes)      0         1287   
1287                               Colbert, Mr. Patrick      0         1288   
1288  Frolicher-Stehli, Mrs. Maxmillian (Margaretha ...      1         1289   
1289                     Larsson-Rondberg, Mr. Edvard A      0         1290   
1290                           Conlon, Mr. Thomas Henry      0         1291   
1291                            Bonnell, Miss. Caroline      0         1292   
1292                                    Gale, Mr. Harry      0         1293   
1293                     Gibson, Miss. Dorothy Winifred      1         1294   
1294                             Carrau, Mr. Jose Pedro      0         1295   
1295                       Frauenthal, Mr. Isaac Gerald      0         1296   
1296       Nourney, Mr. Alfred (Baron von Drachstedt")"      0         1297   
1297                          Ware, Mr. William Jeffery      0         1298   
1298                         Widener, Mr. George Dunton      1         1299   
1299                    Riordan, Miss. Johanna Hannah""      0         1300   
1300                          Peacock, Miss. Treasteall      1         1301   
1301                             Naughton, Miss. Hannah      0         1302   
1302    Minahan, Mrs. William Edward (Lillian E Thorpe)      0         1303   
1303                     Henriksson, Miss. Jenny Lovisa      0         1304   
1304                                 Spector, Mr. Woolf      0         1305   
1305                       Oliva y Ocana, Dona. Fermina      0         1306   
1306                       Saether, Mr. Simon Sivertsen      0         1307   
1307                                Ware, Mr. Frederick      0         1308   
1308                           Peter, Master. Michael J      1         1309   

      Pclass  SibSp  Survived      ...      FamilyLarge  TicketType   Title  \
0          3      1       0.0      ...                0           A      Mr   
1          1      1       1.0      ...                0           P     Mrs   
2          3      0       1.0      ...                0           S    Miss   
3          1      1       1.0      ...                0           1     Mrs   
4          3      0       0.0      ...                0           3      Mr   
5          3      0       0.0      ...                0           3      Mr   
6          1      0       0.0      ...                0           1      Mr   
7          3      3       0.0      ...                1           3  Master   
8          3      0       1.0      ...                0           3     Mrs   
9          2      1       1.0      ...                0           2     Mrs   
10         3      1       1.0      ...                0           P    Miss   
11         1      0       1.0      ...                0           1    Miss   
12         3      0       0.0      ...                0           A      Mr   
13         3      1       0.0      ...                1           3      Mr   
14         3      0       0.0      ...                0           3    Miss   
15         2      0       1.0      ...                0           2     Mrs   
16         3      4       0.0      ...                1           3  Master   
17         2      0       1.0      ...                0           2      Mr   
18         3      1       0.0      ...                0           3     Mrs   
19         3      0       1.0      ...                0           2     Mrs   
20         2      0       0.0      ...                0           2      Mr   
21         2      0       1.0      ...                0           2      Mr   
22         3      0       1.0      ...                0           3    Miss   
23         1      0       1.0      ...                0           1      Mr   
24         3      3       0.0      ...                1           3    Miss   
25         3      1       1.0      ...                1           3     Mrs   
26         3      0       0.0      ...                0           2      Mr   
27         1      3       0.0      ...                1           1      Mr   
28         3      0       1.0      ...                0           3    Miss   
29         3      0       0.0      ...                0           3      Mr   
...      ...    ...       ...      ...              ...         ...     ...   
1279       3      0       NaN      ...                0           3      Mr   
1280       3      3       NaN      ...                1           3  Master   
1281       1      0       NaN      ...                0           1      Mr   
1282       1      0       NaN      ...                0           P     Mrs   
1283       3      0       NaN      ...                0           C  Master   
1284       2      0       NaN      ...                0           C      Mr   
1285       3      3       NaN      ...                1           3      Mr   
1286       1      1       NaN      ...                0           1     Mrs   
1287       3      0       NaN      ...                0           3      Mr   
1288       1      1       NaN      ...                0           1     Mrs   
1289       3      0       NaN      ...                0           3      Mr   
1290       3      0       NaN      ...                0           2      Mr   
1291       1      0       NaN      ...                0           3    Miss   
1292       2      1       NaN      ...                0           2      Mr   
1293       1      0       NaN      ...                0           1    Miss   
1294       1      0       NaN      ...                0           1      Mr   
1295       1      1       NaN      ...                0           1      Mr   
1296       2      0       NaN      ...                0           S      Mr   
1297       2      1       NaN      ...                0           2      Mr   
1298       1      1       NaN      ...                0           1      Mr   
1299       3      0       NaN      ...                0           3    Miss   
1300       3      1       NaN      ...                0           S    Miss   
1301       3      0       NaN      ...                0           3    Miss   
1302       1      1       NaN      ...                0           1     Mrs   
1303       3      0       NaN      ...                0           3    Miss   
1304       3      0       NaN      ...                0           A      Mr   
1305       1      0       NaN      ...                0           P    Dona   
1306       3      0       NaN      ...                0           S      Mr   
1307       3      0       NaN      ...                0           3      Mr   
1308       3      1       NaN      ...                0           2  Master   

      Fare_cat  Bad_ticket  Young  Shared_ticket  Ticket_group   Fare_eff  \
0            0        True   True              0             1   7.250000   
1            1       False  False              1             2  35.641650   
2            0       False   True              0             1   7.925000   
3            1       False  False              1             2  26.550000   
4            0        True  False              0             1   8.050000   
5            0        True   True              0             1   8.458300   
6            1       False  False              1             2  25.931250   
7            1        True   True              1             5   4.215000   
8            1        True   True              1             3   3.711100   
9            1       False   True              1             2  15.035400   
10           1       False   True              1             3   5.566667   
11           1       False   True              0             1  26.550000   
12           0        True   True              0             1   8.050000   
13           1        True  False              1             7   4.467857   
14           0        True   True              0             1   7.854200   
15           1       False  False              0             1  16.000000   
16           1        True   True              1             6   4.854167   
17           1       False   True              0             1  13.000000   
18           1        True  False              1             2   9.000000   
19           0       False   True              0             1   7.225000   
20           1       False  False              1             2  13.000000   
21           1       False  False              0             1  13.000000   
22           0        True   True              0             1   8.029200   
23           1       False   True              0             1  35.500000   
24           1        True   True              1             5   4.215000   
25           1        True  False              1             7   4.483929   
26           0       False   True              0             1   7.225000   
27           2       False   True              1             6  43.833333   
28           0        True   True              0             1   7.879200   
29           0        True   True              0             1   7.895800   
...        ...         ...    ...            ...           ...        ...   
1279         0        True   True              0             1   7.750000   
1280         1        True   True              1             5   4.215000   
1281         1       False   True              1             4  23.375000   
1282         1       False  False              1             2  19.700000   
1283         1       False   True              1             3   6.750000   
1284         1       False  False              0             1  10.500000   
1285         1        True   True              1             3   7.341667   
1286         1       False   True              1             2  30.000000   
1287         0        True   True              0             1   7.250000   
1288         1       False  False              1             2  39.600000   
1289         0        True   True              0             1   7.775000   
1290         0       False  False              0             1   7.733300   
1291         2        True   True              1             4  41.216675   
1292         1       False  False              1             2  10.500000   
1293         1       False   True              1             2  29.700000   
1294         1       False   True              1             2  23.550000   
1295         1       False  False              0             1  27.720800   
1296         1       False   True              0             1  13.862500   
1297         1       False   True              0             1  10.500000   
1298         2       False  False              1             5  42.300000   
1299         0        True   True              0             1   7.720800   
1300         1       False   True              1             3   4.591667   
1301         0        True   True              0             1   7.750000   
1302         1       False  False              1             3  30.000000   
1303         0        True   True              0             1   7.775000   
1304         0        True   True              0             1   8.050000   
1305         2       False  False              1             3  36.300000   
1306         0       False  False              0             1   7.250000   
1307         0        True   True              0             1   8.050000   
1308         1       False   True              1             3   7.452767   

      Fare_eff_cat  
0                0  
1                2  
2                0  
3                2  
4                0  
5                0  
6                2  
7                0  
8                0  
9                1  
10               0  
11               2  
12               0  
13               0  
14               0  
15               1  
16               0  
17               1  
18               1  
19               0  
20               1  
21               1  
22               0  
23               2  
24               0  
25               0  
26               0  
27               2  
28               0  
29               0  
...            ...  
1279             0  
1280             0  
1281             2  
1282             2  
1283             0  
1284             1  
1285             0  
1286             2  
1287             0  
1288             2  
1289             0  
1290             0  
1291             2  
1292             1  
1293             2  
1294             2  
1295             2  
1296             1  
1297             1  
1298             2  
1299             0  
1300             0  
1301             0  
1302             2  
1303             0  
1304             0  
1305             2  
1306             0  
1307             0  
1308             0  

[1309 rows x 38 columns]

In [39]:
#pfunc.pltCorrel( combined )
#pfunc.pltCorrel( full )
#pfunc.pltCorrel( full )

Correlations to Investigate

Pclass is correlated to Fare ( 1st class tickets would be more expensive than other classes )

Pclass x Age

SibSp X Age

SibSp x Fare

SibSp is correlate to Parch ( large families would have high values of parents aboard and solo travellers would have zero parents aboard )

Pclass noticeable correlates to Survived ( Expected correlation with higher classes to survive as known )


In [40]:
# Plot distributions of Age of passangers who survived or did not survive
#pfunc.pltDistro( train , var = 'Age' , target = 'Survived' , row = 'Sex' )

In [41]:
# Plot distributions of Fare of passangers who survived or did not survive
#pfunc.pltDistro( train , var = 'Survived' , target = 'Pclass' , row = 'Sex' )

In [42]:
# Plot distributions of Parch of passangers who survived or did not survive
#pfunc.pltDistro( train , var = 'Parch' , target = 'Survived' , row = 'Sex' )

In [43]:
full.head(5)


Out[43]:
Age Cabin Embarked Fare Name Parch PassengerId Pclass SibSp Survived ... FamilyLarge TicketType Title Fare_cat Bad_ticket Young Shared_ticket Ticket_group Fare_eff Fare_eff_cat
0 22.0 U S 7.2500 Braund, Mr. Owen Harris 0 1 3 1 0.0 ... 0 A Mr 0 True True 0 1 7.25000 0
1 38.0 C C 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 2 1 1 1.0 ... 0 P Mrs 1 False False 1 2 35.64165 2
2 26.0 U S 7.9250 Heikkinen, Miss. Laina 0 3 3 0 1.0 ... 0 S Miss 0 False True 0 1 7.92500 0
3 35.0 C S 53.1000 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 4 1 1 1.0 ... 0 1 Mrs 1 False False 1 2 26.55000 2
4 35.0 U S 8.0500 Allen, Mr. William Henry 0 5 3 0 0.0 ... 0 3 Mr 0 True False 0 1 8.05000 0

5 rows × 38 columns


In [49]:
# Plot distributions of Age of passangers who survived or did not survive

#pfunc.pltCategories( train , cat = 'Embarked' , target = 'Survived' ) 
#pfunc.pltCategories( train , cat = 'Pclass' , target = 'Survived' )
#pfunc.pltCategories( train , cat = 'Sex' , target = 'Survived' )
#pfunc.pltCategories( train , cat = 'Parch' , target = 'Survived' )
#pfunc.pltCategories( train , cat = 'SibSp' , target = 'Survived' )
#pfunc.pltDistro( train , var = 'Age' , target = 'Survived' , row = 'Sex' )
full = full.drop('Survived',1)

In [ ]:
def getTitles(dframe):
    dframe['Title'] = dframe['Name'].map(lambda name:name.split(',')[1].split('.')[0].strip())
    myDict = {	"Capt":       "Officer", 
    "Col":        "Officer",
    "Major":      "Officer",
    "Dr":         "Officer",
    "Rev":        "Officer",
    "Lady" :      "Royalty",
    "Jonkheer":   "Royalty",
    "Don":        "Royalty",
    "Sir" :       "Royalty",
    "the Countess":"Royalty",
    "Dona":       "Royalty",
    "Mme":        "Mrs",
    "Mlle":       "Miss",
    "Ms":         "Mrs",
    "Mr" :        "Mr",
    "Mrs" :       "Mrs",
    "Miss" :      "Miss",
    "Master" :    "Master"
    }
    
    dframe['Title'] = dframe.Title.map(myDict)
    return dframe

In [57]:
full = getTitles(full)
full.head()


Out[57]:
Age Cabin Embarked Fare Name Parch PassengerId Pclass SibSp Ticket ... FamilyLarge TicketType Title Fare_cat Bad_ticket Young Shared_ticket Ticket_group Fare_eff Fare_eff_cat
0 22.0 U S 7.2500 Braund, Mr. Owen Harris 0 1 3 1 A/5 21171 ... 0 A Mr 0 True True 0 1 7.25000 0
1 38.0 C C 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 2 1 1 PC 17599 ... 0 P Mrs 1 False False 1 2 35.64165 2
2 26.0 U S 7.9250 Heikkinen, Miss. Laina 0 3 3 0 STON/O2. 3101282 ... 0 S Miss 0 False True 0 1 7.92500 0
3 35.0 C S 53.1000 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 4 1 1 113803 ... 0 1 Mrs 1 False False 1 2 26.55000 2
4 35.0 U S 8.0500 Allen, Mr. William Henry 0 5 3 0 373450 ... 0 3 Mr 0 True False 0 1 8.05000 0

5 rows × 37 columns


In [56]:
# plot functions
import pltFunctions as pfunc
train_X, test_X, target_y = pfunc.prepareTrainTestTarget(full)
#train_valid_X = full[ 0:891 ]
#train_valid_y = full.Survived
#test_X = full[ 891: ]
#train_X , valid_X , train_y , valid_y = train_test_split( train_X , train_valid_y , train_size = .7 )

print (full.shape , train_X.shape , target_y.shape , test_X.shape)


(1309, 37) (891, 37) (891,) (418, 37)

In [51]:
model = RandomForestClassifier(n_estimators=100)
#model = SVC()
model.fit( train_X , target_y )


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-51-ebc6e6e342d3> in <module>()
      1 model = RandomForestClassifier(n_estimators=100)
      2 #model = SVC()
----> 3 model.fit( train_X , target_y )

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py in fit(self, X, y, sample_weight)
    245         """
    246         # Validate or convert input data
--> 247         X = check_array(X, accept_sparse="csc", dtype=DTYPE)
    248         y = check_array(y, accept_sparse='csc', ensure_2d=False, dtype=None)
    249         if issparse(X):

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    380                                       force_all_finite)
    381     else:
--> 382         array = np.array(array, dtype=dtype, order=order, copy=copy)
    383 
    384         if ensure_2d:

ValueError: could not convert string to float: 'Mr'

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]: